CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813
CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813Jens-G wants to merge 2 commits into
Conversation
…ng backslashes format_value_text in formatting.py doubles backslashes for terminal display (so SELECT output renders them visibly). When used via ExportProcess.format_value for COPY TO, this pre-escaping is applied before csv.writer runs its own backslash escaping (escapechar='\\'), resulting in quadrupled backslashes in the CSV file. On COPY FROM the csv.reader unescapes once, leaving doubled backslashes in Cassandra — data corruption that compounds on every round-trip. The fix adds an escape_backslash parameter (default True, preserving existing terminal display behaviour) and passes escape_backslash=False from the CSV export path in ExportProcess.format_value. The parameter is propagated through format_simple_collection, format_value_list/set/tuple/map, and format_value_utype so that collection types (list<text>, set<text>, map<text,text>, UDTs) are covered as well. Generated-by: Claude Sonnet 4.6 (Anthropic) with human review and direction
|
@Jens-G It seems the exporter is sending values through the display formatter, which doubles backslashes for human-readable SELECT output, before handing them to the CSV writer. Claude suggests just stop using the display formatter for CSV export of text in copyutil.py. What would you think of that approach? |
|
TBH if it works I'm fine with either approach 👍 🚀 |
| formatted = formatter(val, cqltype=cqltype, | ||
| encoding=self.encoding, colormap=NO_COLOR_MAP, date_time_format=self.date_time_format, | ||
| float_precision=cqltype.precision, nullval=self.nullval, quote=False, | ||
| escape_backslash=False, |
There was a problem hiding this comment.
Consider an alternative approach where formatted_value() bypasses display formatting for text.
format_value():
...
if cqltype.type_name in ('text', 'varchar', 'ascii'):
return val if val.isprintable() else None
| escape_backslash=False, | ||
| decimal_sep=self.decimal_sep, thousands_sep=self.thousands_sep, | ||
| boolean_styles=self.boolean_styles) | ||
| return formatted |
|
Hi @Jens-G , if you are busy I can add the unit tests for this and you can review that — just let me know if that works. ? |
|
@Jens-G the PR needs additional work. Based on our discuss of RFC 4180, proper text values don't require formatting. In format_value(): that may eliminate the need to have a new argument for escape_backslash |
… handling Add TestExportFormatValue to test_copyutil.py, covering ExportProcess.format_value: - scalar text/varchar/ascii values keep single backslashes on export - text inside list/set/map/tuple keeps single backslashes (the collection formatters propagate escape_backslash, which a scalar-only type check misses) - backslashes survive a csv.writer -> csv.reader round-trip with the COPY dialect - the terminal-display path still doubles backslashes (escape_backslash default), preserving SELECT rendering The tests were confirmed to fail when the escape_backslash=False export change is reverted, so they guard against regressing the fix. Generated-by: Claude Opus 4.8 (Anthropic) with human review and direction Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks @bschoening! I worked through the scalar fast-path. It's a clean simplification for The ticket also covers I've added On Happy to add a scalar fast-path on top of the parameter if you'd like it for readability — as an addition rather than a replacement. WDYT? |
|
@arvindKandpal-ksolves thanks a lot for offering to help! I've gone ahead and added the unit tests in this PR ( |
|
@Jens-G Based on what we learned with CASSANDRA-21381, we should broaden the scope here to handle all TEXTDATA properly. COPY TO should allow roundtrip all TEXTDATA. Backslash is just another character in the %x2D-7E range - a perfectly ordinary TEXTDATA character with no special status so escape_backslash as an option isn't needed. The only escape production is DQUOTE.
BTW, sineemore/csv-test-data has test fixtures we could cherry-pick for unit tests. Relevant RFC 4180 grammer:
|
Summary
COPY TOfollowed byCOPY FROMcorrupts text column values that contain backslashes: each round-trip doubles the backslash count. Reported in CASSANDRA-21131.Before (one round-trip):
V\S→ exported CSV:V\\\\S→ re-imported:V\\S❌\"Marianne"\→ re-imported:\\"Marianne"\\❌list<text>,set<text>,map<text,text>, tuples and UDTs with text fields are affected in the same way.Root Cause
format_value_textinformatting.pydoubles backslashes unconditionally:This is intentional for terminal display (SELECT output shows
V\\Sso the backslash is visible). However,ExportProcess.format_valueincopyutil.pycalls the same function when writing CSV. Thecsv.writer(configured withescapechar='\\') then escapes backslashes a second time, quadrupling them in the CSV file. OnCOPY FROMthecsv.readerunescapes once, leaving doubled backslashes in Cassandra.Fix
Add an
escape_backslashparameter (defaultTrue, preserving existing terminal display behaviour) toformat_value_text,format_simple_collection, and all collection formatters. Passescape_backslash=FalsefromExportProcess.format_valueso thecsv.writerhandles all backslash escaping exclusively.Changed functions:
format_value_text— new parameterformat_simple_collection— new parameter, propagated to elementformat_valuecallsformat_value_list,format_value_set,format_value_tuple— new parameter, forwarded toformat_simple_collectionformat_value_map— new parameter, propagated throughsubformatformat_value_utype— new parameter, propagated throughformat_field_valueExportProcess.format_valueincopyutil.py— passesescape_backslash=FalseTest Plan
Two standalone Python test scripts (no running Cassandra cluster required) are attached to the JIRA ticket and verify the bug and fix:
test_cassandra_21131.py— 10 test cases for plaintextcolumns: 5/10 pass before fix → 10/10 aftertest_cassandra_21131_collections.py— 12 test cases forlist/set/map<text>: 3/12 before → 12/12 afterIntegration testing against a live cluster with the exact scenario from the bug report (
COPY TO→TRUNCATE→COPY FROM→SELECT) is needed before merge.Notes
UNICODE_CONTROLCHARS_REconverting control chars like\nto repr-notation\\nduring CSV export) was discovered and will be tracked in a separate ticket.Generated-by:commit token is included per ASF generative tooling policy. The fix was developed with AI assistance (Claude Sonnet 4.6 / Anthropic) under human review and direction. All code has been verified manually.🤖 Generated with Claude Code